Project Overview: An investigation into loan data to guage relations, if any between parameters of interest!
Note: As per the requirements of this project, code given below follows is limited to 80 characters or less.
All values are in dollars in terms of revenue.
Using commands, we will inspect structure, derive counts for rows and columns, column names, etc.
Before creating any plots, we will explore the dataset. Using commands, will retrive counts fo rows and columns, column names, etc. We are using names(), str(), dim(), head() to inspect the dataset
## [1] 113937 81
In this section, we perform some preliminary exploration of the loan dataset, run some summaries of the data and create univariate plots to understand the structure of the individual variables in the dataset. Run summary statistics on the data
First, we shall analyze the different income ranges of lon recepients. While income does not correlate to loan repayment success, it is a useful factor to segment customer base. We illustrate a simple bar plot with counts of number of records in each income bracket. We can see that 25K-74.999K are the biggest loan audiences.
Now, we shall analyze the estimated loss amount in each loan transaction. We can plot a histogram using the concept of bar plots. We can see that 25K-74.999K are the biggest loan audiences. As clear form the plot, this ratio is low.
Next, we analyze the distribution of records in employment status brackets. This is an important metric Based on the results, any high performing segment can be favored for future loan transactions. It can be seen here that income groups with salries (Employment and Full time) form the major cohort of the loan applicants. However, as categories like Not Employed and Other are present in addition to null values, this is not a very useful stand alone variable.
Another important consideration while approving a loan is debt to income ratio. We study this using a bar plot. Again, as a single variable, while not very insightful, it does demonstrate that most of the applicants have favorable debt to income ratio. This is suggestive of stringent checks in the loan application process, for instance.
Next, we explore the distribution of original loan amount among the loan records. The principal sum, in conjunction with debt to income ratio, can give provide pointers on the success of loans. We study this using a bar plot. Most loans amounts fall in the low to mid range.
Next, we explore the distribution of loan amounts with geographic locations to see if a pattern exists with loan requirements. We study this using a bar plot. It is interesting to note that some states have very high number of borrowers while others barely have any. Of course, this observation to the limitations of data sampling. Potential causes include uneven sampling of data, lack of lenders and businesses.
Next, we explore the distribution of Credit Grades among the loan applicants. We study this using a bar plot. Since majority of the loans have not been assigned a grade, we are pointed to a good line of analytical scrutiny. It can be that the loan process/ stage is such that grades are not yet available. Or that the applicant has limited credit history, or that the loan applicant is a joint venture. In combination with loan status, this might yield better insights.
Next, we explore the distribution of loan term among the loan applicants. We study this using a histogram. We can conclude that 36 month loans are most popular. So, our next iteration of analysis can be to see if this is a result of rates, offerings, or if other fcators contribute.
Next, we explore the distribution of Monthly loan payments among the loan applicants. We study this using a bar plots.
Next, we explore the distribution of Employment status among the loan applicants. We study this using a Cleveland dot chart
Next, we explore the distribution of loan status among the loan applicants. We study this using a simple bar chart Based on our plot, we see that maximum loans are current or completed. This can bode well for the institution depending on what the financial gains are from current loans. We also can note that chargedoff, defaulted and past due loans are fewer in number which is a positive factor.
Further, we explore the distribution of lender yield status from the loan transactions to gauge what the feasability is. We study this using a simple bar chart We understand that lender yield, while not super high is fairly high around the mid levels.
Further, we explore the investigate what the typical borrower rates are We study this using a simple bar chart and get typical rates for loans
Notes: The bar plots and histograms illustrated above are for categorical and numeric variables of interest from the loan dataset. Notes: Reflection on and summary on completion of univariate explorations. Answers below to project Questions
data.frame with 113937 obs. of 81 variables.
Thus, 113937 rows and 81 columns
LoanOriginalAmount vs EstimatedLoss
into your feature(s) of interest? > InquiriesLast6Months, DebtToIncomeRatio, IncomeRange, LoanStatus, BorrowerState, CreditGrade, Term
No
 of the data? If so, why did you do this? > No
Notes: Based on observations in the univariate plots, we explore some relationships between variables that might be interesting to look at in this section.
Here, we explore the relation if any, between Loan Original Amount versus Estimated Loss. We study this using a simple scatter plot As evident here, the plot is roughly negatively sloped. As the loan amount increases, estimated losses reduce. This can be a good point for further investigation. We can check if a correlation exists, if it is causal. The relation hints at many plausible reasons, such as higher background check for higher loan amounts, greater payment ability with higher borrowing ability, more steps or stronger action to reclaim larger loan amounts or simply greater incentive to repay larger loans to avoid crippling compound interests.
Here, we explore the relation if any, between Debt To Income Ratio versus Estimated Loss. We study this using a simple scatter plot This is an interesting plot because the concentration of points is on Y-axis. This again is a line of exploration worth pursuing further.
## $title
## [1] "Debt to Income Ratio versus Estimated Loss"
##
## $subtitle
## NULL
##
## attr(,"class")
## [1] "labels"
Here, we explore the relation if any, between Income Range versus Estimated Loss. We study this using a simple scatter plot The losses are roughly equal over the income ranges.
Here, we explore the relation if any, between Income Range versus Estimated Loss. We study this using a simple scatter plot The losses are roughly equal over the income ranges.
##
## Pearson's product-moment correlation
##
## data: loanData$LoanOriginalAmount and loanData$EstimatedLoss
## t = -138.69, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4353596 -0.4243895
## sample estimates:
## cor
## -0.4298904
##
## Pearson's product-moment correlation
##
## data: loanData$DebtToIncomeRatio and loanData$EstimatedLoss
## t = 35.404, df = 77555, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1191835 0.1330353
## sample estimates:
## cor
## 0.1261155
##
## Pearson's product-moment correlation
##
## data: loanData$InquiriesLast6Months and loanData$EstimatedLoss
## t = 84.881, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2735462 0.2859500
## sample estimates:
## cor
## 0.2797598
##
## Pearson's product-moment correlation
##
## data: as.numeric(loanData$Term) and loanData$EstimatedLoss
## t = -31.39, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1137865 -0.1004841
## sample estimates:
## cor
## -0.1071401
##
## Pearson's product-moment correlation
##
## data: loanData$BorrowerAPR and loanData$EstimatedLoss
## t = 881.84, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9488713 0.9501952
## sample estimates:
## cor
## 0.9495375
Notes:Summary of observations found in bivariate explorations here. Answers to the questions below are below.
. How did the feature(s) of interest vary with other features in dataset? > LoanOriginalAmount vs EstimatedLoss shows higher clustering for lower values of each. This is interesting and needs to be probed further in conjunction with terms and rates. > EmploymentStatus, DebtToIncomeRatio, IncomeRange, LoanStatus, BorrowerState, CreditGrade, Term were the other variables observed. Borrower APR and Estimated Loss show a direct correlation, which can be attributed to higher rates being difficult to return and hence showing greater loss for defaulters.
One of the factors is Inquiries in Last 6 months which I supposed would have a high correlation with Estimated Loss. However, the correlation is weak at best. The variable Term shows a negative correlation with Estimated Loss, contrary to my initial line of thought.
Estimated Loss versus Borrower Apr was the strongest relation among those I investigated.
Notes: Based on the relationships in the bivariate plots section, creating a few multivariate plots to investigate more complex interactions between variables.
Next, we explore the relation if any, between Estimated Loss and Loan Original Amount but with Income range We study this using a scatter plot and colors to represent the third variable. Since estimated loss and Principal were roughly inversely proportional, we can probe to see if income range had a role to play. Also shown here is the linear model. We can note that losses reduce with increased income range. Similar to our earlier analysis, this can stem from many plausible reasons, such as higher background check for higher loan amounts, greater payment ability with higher borrowing ability, more steps or stronger action to reclaim larger loan amounts or simply greater incentive to repay larger loans to avoid crippling compound interests.
Next, we explore the relation if any, between Estimated Loss and Loan Original Amount but with Monthly Loan Payment We study this using a scatter plot and size to represent the third variable. Since estimated loss and Principal were roughly inversely proportional, we can probe to see if monthly payments had a role to play. We can see that the monthly payments increase with increase in original loan amount, but not contribute to any spikes in estimated loss. While not contributing new insights, it fits with the our inferences so far that higher the loan amount, lesser the estimated loss.
. Were there features that strengthened each other in terms of at your feature(s) of interest? > On plotting Estimated Loss and Borrower APR along with Credit Grade and Income Range in the two scatterplots, we can view our two primary variables of interest along with the context of other variables like Credit, Income ranges, etc.
The interesting plot was the relation between Estimated Loss, Loan Original Amount and Monthly Loan Payment. It showed clearly that small to medium loan amounts with small monthly payments showed low losses. This is also supplanted by the fact that these small amount lans with low instalments are typical of medium-high income groups as evident by the preceding scatterplot (Relation between Estimated Loss, Loan Original Amount and Income Range)
Summary: Most interesting findings Section
It is interesting to note that some states have very high number of borrowers while others barely have any. Of course, this observation to the limitations of data sampling. Potential causes include uneven sampling of data, lack of lenders and businesses. Since only a single categorical variable is being diplayed, a bar plot is suitable.
This plot shows that as the APR increases, so do estimated losses. Complementing the simple compounding principle from the world of finance and accounts, the loan dataset clearly confirms the adage. As rates go higher, it is difficult to pay causing a loss for banker and greater instalments for the loan recepient. Since two numerical variables are bieng compared, the scatterplot is an obvious choice. The graph is supported by the correlation (value is 0.9495) done in prior sections.
The relation between Estimated Loss, Loan Original Amount and Monthly Loan Payment is depicted using a scatterplot with color representing the third variable. As all three variables are numeric values, a scatterplot is teh perfect choice. It showed clearly that small to medium loan amounts with small monthly payments showed low losses. This is also supplanted by the fact that these small amount lans with low instalments are typical of medium-high income groups as evident by the preceding scatterplot (Relation between Estimated Loss, Loan Original Amount and Income Range)
Notes: Due to large numer of variables, it was difficult to choose which ones to work with. My primary variables of interest were LoanOriginalAmount vs EstimatedLoss. While the correlation is not very strong, visually one can view a weak polynomial based, negatively inclined curve. Some of the assumptions I had at the outset were negated or not strongly proven by the dataset. For instance, I had assumed that Inquiries in Last 6 months would have a high correlation with Estimated Loss. However, the correlation is weak at best. Also, the variable Term shows a negative correlation with Estimated Loss, contrary to my initial line of thought. The insights from LoanOriginalAmount vs EstimatedLoss should be applied to Occupation as well and then to Credit Lines, to see if any relation exists. Also, borrower APR and Estimated Loss have a high correlation. It should be further viewed with the trader and credit line variables to see if high risk ventures correlate to loan defaulting.